For this blog post, I use the {modeltime} package to forecast 3 months of daily sales in Q1 for a young ‘Superstore’ company selling furniture, technology, and office supplies. The forecast can then be used to make decisions about supply-chain orders, warehouse inventory, and if/when new employees are needed to meet predicted sales demand after the holiday season.
The Superstore Sales Dataset on Kaggle is relatively unexplored. There’s a couple python exploratory data analysis projects posted but nothing using R and only one 7-day forecast using a simple SARIMAX model.
df <- read_csv("train.csv")
glimpse(df)
## Rows: 9,800
## Columns: 18
## $ `Row ID` <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ `Order ID` <chr> "CA-2017-152156", "CA-2017-152156", "CA-2017-138…
## $ `Order Date` <chr> "08/11/2017", "08/11/2017", "12/06/2017", "11/10…
## $ `Ship Date` <chr> "11/11/2017", "11/11/2017", "16/06/2017", "18/10…
## $ `Ship Mode` <chr> "Second Class", "Second Class", "Second Class", …
## $ `Customer ID` <chr> "CG-12520", "CG-12520", "DV-13045", "SO-20335", …
## $ `Customer Name` <chr> "Claire Gute", "Claire Gute", "Darrin Van Huff",…
## $ Segment <chr> "Consumer", "Consumer", "Corporate", "Consumer",…
## $ Country <chr> "United States", "United States", "United States…
## $ City <chr> "Henderson", "Henderson", "Los Angeles", "Fort L…
## $ State <chr> "Kentucky", "Kentucky", "California", "Florida",…
## $ `Postal Code` <dbl> 42420, 42420, 90036, 33311, 33311, 90032, 90032,…
## $ Region <chr> "South", "South", "West", "South", "South", "Wes…
## $ `Product ID` <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-LA-10…
## $ Category <chr> "Furniture", "Furniture", "Office Supplies", "Fu…
## $ `Sub-Category` <chr> "Bookcases", "Chairs", "Labels", "Tables", "Stor…
## $ `Product Name` <chr> "Bush Somerset Collection Bookcase", "Hon Deluxe…
## $ Sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, …
After a little digging a few things become obvious. There are two date variables, but lets only use Order Date going forward. It seems Order Date format is day/month/year. They only ship within the United States. We don’t need individual order_id numbers. Let’s clean up column names, format the date, factor the categorical variables, and remove what we definitely don’t need.
data_clean_tbl <- janitor::clean_names(df) %>%
as_tibble() %>%
mutate(order_date = lubridate::dmy(order_date),
across(customer_id:product_name, .fns = as.factor)) %>%
select(-c(ship_date, country, order_id, ship_mode))
glimpse(data_clean_tbl)
## Rows: 9,800
## Columns: 14
## $ row_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ order_date <date> 2017-11-08, 2017-11-08, 2017-06-12, 2016-10-11, 2…
## $ customer_id <fct> CG-12520, CG-12520, DV-13045, SO-20335, SO-20335, …
## $ customer_name <fct> Claire Gute, Claire Gute, Darrin Van Huff, Sean O'…
## $ segment <fct> Consumer, Consumer, Corporate, Consumer, Consumer,…
## $ city <fct> Henderson, Henderson, Los Angeles, Fort Lauderdale…
## $ state <fct> Kentucky, Kentucky, California, Florida, Florida, …
## $ postal_code <fct> 42420, 42420, 90036, 33311, 33311, 90032, 90032, 9…
## $ region <fct> South, South, West, South, South, West, West, West…
## $ product_id <fct> FUR-BO-10001798, FUR-CH-10000454, OFF-LA-10000240,…
## $ category <fct> Furniture, Furniture, Office Supplies, Furniture, …
## $ sub_category <fct> Bookcases, Chairs, Labels, Tables, Storage, Furnis…
## $ product_name <fct> "Bush Somerset Collection Bookcase", "Hon Deluxe F…
## $ sales <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3680, 48…
What period of time does this data cover?
t <- summarise_by_time(.data = data_clean_tbl,
.date_var = order_date)
(diff_t <- difftime(last(t$order_date),
first(t$order_date),
units = 'weeks'))
## Time difference of 208.1429 weeks
lubridate::dweeks(208.1429)
## [1] "125884825.92s (~3.99 years)"
This is a daily dataset spanning just under 4 years from 2015-01-03 to 2018-12-30. There are gaps though!
Let’s group by region/state to see where the ‘Superstore’ does business, looking at total orders and total sales.
q <- data_clean_tbl %>%
group_by(region,state) %>%
summarise(orders = n(), sales = sum(sales)) %>%
ungroup() %>%
mutate(state = tidytext::reorder_within(state, orders, region)) %>%
arrange(desc(sales))
ggplot(q, aes(reorder_within(state, sales, region, sep = ""), sales, fill = region)) +
geom_col(show.legend = FALSE, alpha = 0.9) +
facet_grid(region ~., scales = 'free', space = 'free') +
coord_flip() +
tidytext::scale_x_reordered() + # gets rid of the ___region
scale_y_continuous(limits = c(0,500000), expand = c(0,0), labels=dollar_format(prefix="$")) +
theme_dark_grey() +
scale_color_todd_dark_subtle() +
scale_fill_todd_dark() +
theme(panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank()) +
geom_hline(yintercept = c(100000,200000,300000,400000),
color = "#FFCF6A",
alpha = 0.1,
size = 1) +
labs(x = element_blank(), title = 'Sales by State/Region') +
geom_text(aes(label = dollar(round((sales)))),
color="white", size=3.5, fontface = 'bold',
hjust = -0.1, vjust = 0.4)
Looks like the do business throughout the US, with the most business coming from California. They probably do most of their business from online shopping, as these total sale numbers aren’t high enough for multiple warehouses spread across the country.
Lets also look at number of orders and sales grouped by customer_name, category, and segment. Any outliers?
# Want to identify any outlier customers, and see if any category/segment of their business dominates
x <- data_clean_tbl %>%
group_by(customer_name, category, segment) %>%
summarise(sales = sum(sales),
orders = n()) %>%
mutate(log_sales = log(sales))
z <- data_clean_tbl %>%
group_by(category, segment) %>%
summarise(sales = sum(sales),
orders = n()) %>%
ungroup() %>%
mutate(orders_pct = round(prop.table(orders) * 100),
sales_pct = round(prop.table(sales) * 100),
sales_per_order = round(sales/orders)) %>%
pivot_longer(orders_pct:sales_pct,
names_to = 'metric',
values_to = 'percent')
outlier_furniture_consumer <- x %>% filter(segment == 'Consumer',
category == "Furniture") %>%
arrange(desc(orders)) %>% head()
outlier_technology_corporate_homeoffice <- x %>%
filter(segment == 'Corporate' | segment == "Home Office",
category == "Technology") %>%
arrange(desc(log_sales)) %>% head()
# Make specific tbl for outlier Seth Vernon so label doesn't appear in other facets
seth_vernon <- data.frame(category = factor("Furniture", levels = c("Furniture","Office Supplies","Technology")),
segment = factor("Consumer", levels = c("Consumer","Corporate","Home Office")))
ggplot(x, aes(log(sales), orders, color = category)) +
geom_jitter(aes(shape = segment),
alpha = 0.7,
width = 0,
height = 0.3,
show.legend = FALSE,
size = 2) +
theme_dark_grey() +
scale_color_manual(values = c("#fbb4ae", "#b3cde3", "#fed9a6")) +
facet_grid(segment~category, scales = 'free_y') +
labs(title = "Orders & Sales in log($) per Customer by Category/Segment",
x = 'log($)') +
geom_curve(aes(x = 7, xend = 8.9, y = 18, yend = 16),
data = seth_vernon,
curvature = -0.5,
size = 1,
arrow = arrow(length = unit(2, "mm")),
color = 'white',
alpha = 0.7) +
geom_text(aes(x = 6.5, y = 15),
data = seth_vernon ,
label = "Seth Vernon\n15 Orders\nTotal = $8,332",
size = 3.5,
show.legend = F,
color = 'white',
alpha = 0.7)
Their sales are spread pretty evenly between these categories. More orders are for office supplies, but their largest sales are for technology; not surprising. They have a some loyal customers too, like Seth Vernon.
Now lets focus on sales within this time-series. If we want to forecast sales, we can’t use any other variable but order_date. After 2018-12-30 we don’t have any information about the number of orders or where those orders are coming from. We could build models using , for example, orders as an independent variable to predict sales, but only until 2018-12-30. After that, the models will break down since we don’t have values for orders to predict sales with.
This is when we really start using {timetk} and {modeltime}. Hopefully you can appreciate how well they work alongside {tidyverse} and {tidymodels} packages. No more learning unique processes for every type of model you want to build.
sales_daily_tbl <- data_clean_tbl %>%
summarise_by_time(.date_var = order_date,
.by = 'day',
sales = sum(sales))
Let’s look at the daily total sales for this company over the time-series.
sales_daily_tbl %>%
plot_time_series(.date_var = order_date,
.value = sales,
.interactive = FALSE,
.line_color = "#cccccc",
.smooth_color = "#db0000",
.title = "Daily Total Sales",
.y_lab = "$") +
theme_dark_grey()
There are no days with 0 orders/sales. But in our EDA we noticed gaps in the data. This data spans just under 4 years, but we only have 1,230 rows of daily data. If we have a row for every day, we should have about 365 * 4 or 1,460 rows. We need to pad these missing days so our time-series is complete. We can fill sales with 0 for these missing days.
sales_daily_pad_tbl <- sales_daily_tbl %>%
pad_by_time(order_date,
.by = 'day',
.pad_value = 0)
glimpse(sales_daily_pad_tbl)
## Rows: 1,458
## Columns: 2
## $ order_date <date> 2015-01-03, 2015-01-04, 2015-01-05, 2015-01-06, 2015-01-0…
## $ sales <dbl> 16.448, 288.060, 19.536, 4407.100, 87.158, 0.000, 40.544, …
1,458 rows, great! Now we have a full 4-year daily dataset with no missing dates.
Our sales variable ranges from 0 to ~ 30,000 on any given day, so we’ll need to transform this continuous variable. And since 0’s can mess up our machine learning models, we’ll use a log10 + 1 transformation, followed by standardization that centers our data around 0. We could do this at each recipe stage of our workflows, but we can do it here once and be done with it. Lets also change sales to sales_trans.
salessales_daily_pad_trans_tbl <- sales_daily_pad_tbl %>%
mutate(sales = log_interval_vec(sales,
limit_lower = 0,
offset = 1)) %>%
mutate(sales = standardize_vec(sales)) %>%
rename(sales_trans = sales)
We need to keep track of the Standardization Parameters for when we un-transform the data at the end.
limit_lower <- 0
limit_upper <- 30918.3876
offset <- 1
std_mean_sales <- -4.61074359571939
std_sd_sales <- 2.86059320652223
Let’s look at our new padded & transformed time-series.
sales_daily_pad_trans_tbl %>%
plot_time_series(order_date,
sales_trans,
.interactive = FALSE,
.line_color = "#cccccc",
.smooth_color = "#db0000",
.title = "Daily Total Sales, Padded & Transformed",
.y_lab = "Sales") +
theme_dark_grey()
We might want to add lags, rolling lags, and/or fourier-series features that could help our forecast. This is way to add features to regress on for the future dates we want to forecast. Currently future forecast dates have only order_date to predict with.
Let’s look at ACF/PACF plots to see if any periods look usable.
### ACF/PACF
lag_labels <- data.frame(name = c("ACF","ACF","ACF","ACF","ACF","PACF","PACF","PACF","PACF"),
label = c('7','14','21','28','357','7','14','21','28'))
sales_daily_pad_trans_tbl %>%
plot_acf_diagnostics(.date_var = order_date, .value = sales_trans, .lags = 400,
.show_white_noise_bars = TRUE, .interactive = FALSE, .line_color = "#cccccc",
.line_size = 0.5, .white_noise_line_color = "#db0000", .point_color = "#fed9a6",
.point_size = 0.8) +
theme_dark_grey() +
geom_text(aes(label = label),
data = lag_labels,
x = c(7, 14, 21.7, 29.5, 358, 7, 15, 22, 30),
y = c(.53, .52, .5, .5, .38,.5, .35, .27, .25),
size = 4,
color = "white",
angle = 55)
Clearly there is a strong weekly correlation of sales vs sales 7d, 14d, 21d, … later. The Partial Auto Correlation Features (PACF) de-weights the correlation values depending on the values that come before it. Our PACF shows lags 7, 14, 21, and 28 as potentially important lags. All other lags can be considered white noise or close to white noise.
I experimented with multiple lag and Fourier-Series combos. A lag the length of my forecast (84 days) with rolling 30d, 60d, and 90d averages helped. Additional lags and Fourier-Series did not improve model accuracy. It’s possible I didn’t use them correctly, or experiment enough. I’ll include multiple Fourier-Series with 3 orders each to show the process.
forecast_3_month <- 84
lag_period <- 84 # 84 smallest lag period to get our forecast of 84 into the future
rolling_periods <- c(30, 60, 90) # 1 month, 2 month, and 3 month moving averages as features to catch the trend
full_data_prepared_tbl <- sales_daily_pad_trans_tbl %>%
future_frame(.date_var = order_date,
.length_out = forecast_3_month,
.bind_data = TRUE) %>%
tk_augment_fourier(.date_var = order_date,
.periods = c(7,14,21,28,357),
.K = 3) %>%
tk_augment_lags(.value = sales_trans,
.lags = lag_period ) %>%
tk_augment_slidify(.value = sales_trans_lag84,
.f = ~ mean(.x, na.rm = TRUE),
.period = rolling_periods,
.partial = TRUE,
.align = 'center') %>%
rowid_to_column(var = 'rowid') %>%
rename_with(.cols = contains('lag'),
.fn = ~ str_c('lag_', .)) %>%
rename_with(.cols = matches('(_sin)|(_cos)'),
.fn = ~ str_c('fourier_', .))
glimpse(full_data_prepared_tbl)
## Rows: 1,542
## Columns: 37
## $ rowid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,…
## $ order_date <date> 2015-01-03, 2015-01-04, 2015-01-0…
## $ sales_trans <dbl> -1.002789645, -0.018295287, -0.945…
## $ fourier_order_date_sin7_K1 <dbl> 9.749279e-01, 4.338837e-01, -4.338…
## $ fourier_order_date_cos7_K1 <dbl> -0.2225209, -0.9009689, -0.9009689…
## $ fourier_order_date_sin7_K2 <dbl> -4.338837e-01, -7.818315e-01, 7.81…
## $ fourier_order_date_cos7_K2 <dbl> -0.9009689, 0.6234898, 0.6234898, …
## $ fourier_order_date_sin7_K3 <dbl> -7.818315e-01, 9.749279e-01, -9.74…
## $ fourier_order_date_cos7_K3 <dbl> 0.6234898, -0.2225209, -0.2225209,…
## $ fourier_order_date_sin14_K1 <dbl> 7.818315e-01, 9.749279e-01, 9.7492…
## $ fourier_order_date_cos14_K1 <dbl> 0.6234898, 0.2225209, -0.2225209, …
## $ fourier_order_date_sin14_K2 <dbl> 9.749279e-01, 4.338837e-01, -4.338…
## $ fourier_order_date_cos14_K2 <dbl> -0.2225209, -0.9009689, -0.9009689…
## $ fourier_order_date_sin14_K3 <dbl> 4.338837e-01, -7.818315e-01, -7.81…
## $ fourier_order_date_cos14_K3 <dbl> -0.9009689, -0.6234898, 0.6234898,…
## $ fourier_order_date_sin21_K1 <dbl> -9.972038e-01, -9.308737e-01, -7.8…
## $ fourier_order_date_cos21_K1 <dbl> 0.07473009, 0.36534102, 0.62348980…
## $ fourier_order_date_sin21_K2 <dbl> -1.490423e-01, -6.801727e-01, -9.7…
## $ fourier_order_date_cos21_K2 <dbl> -0.98883083, -0.73305187, -0.22252…
## $ fourier_order_date_sin21_K3 <dbl> 9.749279e-01, 4.338837e-01, -4.338…
## $ fourier_order_date_cos21_K3 <dbl> -0.2225209, -0.9009689, -0.9009689…
## $ fourier_order_date_sin28_K1 <dbl> 4.338837e-01, 6.234898e-01, 7.8183…
## $ fourier_order_date_cos28_K1 <dbl> 9.009689e-01, 7.818315e-01, 6.2348…
## $ fourier_order_date_sin28_K2 <dbl> 7.818315e-01, 9.749279e-01, 9.7492…
## $ fourier_order_date_cos28_K2 <dbl> 0.6234898, 0.2225209, -0.2225209, …
## $ fourier_order_date_sin28_K3 <dbl> 9.749279e-01, 9.009689e-01, 4.3388…
## $ fourier_order_date_cos28_K3 <dbl> 2.225209e-01, -4.338837e-01, -9.00…
## $ fourier_order_date_sin357_K1 <dbl> 0.2778924, 0.2947552, 0.3115267, 0…
## $ fourier_order_date_cos357_K1 <dbl> 0.9606122, 0.9555728, 0.9502374, 0…
## $ fourier_order_date_sin357_K2 <dbl> 0.5338936, 0.5633201, 0.5920486, 0…
## $ fourier_order_date_cos357_K2 <dbl> 0.8455517, 0.8262388, 0.8059022, 0…
## $ fourier_order_date_sin357_K3 <dbl> 0.7478370, 0.7818315, 0.8136468, 0…
## $ fourier_order_date_cos357_K3 <dbl> 0.66388234, 0.62348980, 0.58135949…
## $ lag_sales_trans_lag84 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ lag_sales_trans_lag84_roll_30 <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN,…
## $ lag_sales_trans_lag84_roll_60 <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN,…
## $ lag_sales_trans_lag84_roll_90 <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN,…
The time-series now includes lag, rolling_lag and fourier-series data in future forecast dates. If these features help model accuracy on the training/testing sets, we can assume they’ll help the models forecast future sales. They may or may not help though. Gotta experiment. Let’s look at the lag and rolling lags data.
full_data_prepared_tbl %>%
pivot_longer(c(sales_trans,
lag_sales_trans_lag84,
lag_sales_trans_lag84_roll_30,
lag_sales_trans_lag84_roll_60,
lag_sales_trans_lag84_roll_90)) %>%
plot_time_series(order_date, value, name, .interactive = FALSE, .smooth = FALSE, .title = "30d, 60d, 90d Rolling Lags") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc","#db0000", "#0099ff", "#00ff99", "#ffffcc"))
Notice each of the first 84 days lacks at least some of the lag/rolling lag features. We can’t use these days in our testing/training splits. That’s the downside. We probably could pad those dates somehow, but I won’t try here.
The full_data_prepared_tbl has 1542 rows (1458 original days + 84 future days). We need to separate rows for future days so we can split the original data into training/testing sets. Filter using days lacking sales_trans values.
We can use {skimr} to show us a summary of our data.
forecast_tbl <- full_data_prepared_tbl %>%
filter(is.na(sales_trans))
skimr::skim_without_charts(forecast_tbl)
| Name | forecast_tbl |
| Number of rows | 84 |
| Number of columns | 37 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 36 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| order_date | 0 | 1 | 2018-12-31 | 2019-03-24 | 2019-02-10 | 84 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| rowid | 0 | 1 | 1500.50 | 24.39 | 1459.00 | 1479.75 | 1500.50 | 1521.25 | 1542.00 |
| sales_trans | 84 | 0 | NaN | NA | NA | NA | NA | NA | NA |
| fourier_order_date_sin7_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos7_K1 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin7_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos7_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin7_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos7_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin14_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos14_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
| fourier_order_date_sin14_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos14_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin14_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos14_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
| fourier_order_date_sin21_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
| fourier_order_date_cos21_K1 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
| fourier_order_date_sin21_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
| fourier_order_date_cos21_K2 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
| fourier_order_date_sin21_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos21_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
| fourier_order_date_cos28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
| fourier_order_date_sin28_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos28_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
| fourier_order_date_sin28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
| fourier_order_date_cos28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.66 | 0.00 | 0.66 | 1.00 |
| fourier_order_date_sin357_K1 | 0 | 1 | 0.91 | 0.08 | 0.72 | 0.85 | 0.93 | 0.98 | 1.00 |
| fourier_order_date_cos357_K1 | 0 | 1 | 0.03 | 0.41 | -0.64 | -0.33 | 0.03 | 0.39 | 0.69 |
| fourier_order_date_sin357_K2 | 0 | 1 | 0.04 | 0.69 | -0.99 | -0.62 | 0.06 | 0.71 | 1.00 |
| fourier_order_date_cos357_K2 | 0 | 1 | -0.67 | 0.28 | -1.00 | -0.93 | -0.74 | -0.45 | -0.05 |
| fourier_order_date_sin357_K3 | 0 | 1 | -0.36 | 0.52 | -1.00 | -0.85 | -0.45 | 0.09 | 0.65 |
| fourier_order_date_cos357_K3 | 0 | 1 | -0.03 | 0.78 | -1.00 | -0.85 | -0.09 | 0.84 | 1.00 |
| lag_sales_trans_lag84 | 0 | 1 | 0.53 | 0.76 | -2.00 | 0.29 | 0.71 | 1.00 | 1.60 |
| lag_sales_trans_lag84_roll_30 | 0 | 1 | 0.54 | 0.16 | 0.22 | 0.46 | 0.57 | 0.64 | 0.81 |
| lag_sales_trans_lag84_roll_60 | 0 | 1 | 0.56 | 0.07 | 0.43 | 0.51 | 0.56 | 0.61 | 0.66 |
| lag_sales_trans_lag84_roll_90 | 0 | 1 | 0.55 | 0.06 | 0.45 | 0.51 | 0.53 | 0.61 | 0.66 |
The forecast_tbl has 84 rows, 37 columns for 84 future dates starting 2018-12-31 and ending 2019-03-24. Rows are only missing sales_trans data. Good to go.
The rest of the data we’ll call the data_prepared_tbl. We need to drop those rows that lack any lag/rolling lag data.
data_prepared_tbl <- full_data_prepared_tbl %>%
filter(!is.na(sales_trans)) %>%
drop_na()
skimr::skim_without_charts(data_prepared_tbl)
| Name | data_prepared_tbl |
| Number of rows | 1374 |
| Number of columns | 37 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| numeric | 36 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| order_date | 0 | 1 | 2015-03-28 | 2018-12-30 | 2017-02-11 | 1374 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| rowid | 0 | 1 | 771.50 | 396.78 | 85.00 | 428.25 | 771.50 | 1114.75 | 1458.00 |
| sales_trans | 0 | 1 | 0.04 | 0.98 | -2.00 | -0.25 | 0.37 | 0.70 | 1.75 |
| fourier_order_date_sin7_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos7_K1 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin7_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos7_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin7_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos7_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin14_K1 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos14_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.22 | 0.62 | 1.00 |
| fourier_order_date_sin14_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos14_K2 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin14_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos14_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin21_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
| fourier_order_date_cos21_K1 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
| fourier_order_date_sin21_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.68 | 0.00 | 0.68 | 1.00 |
| fourier_order_date_cos21_K2 | 0 | 1 | 0.00 | 0.71 | -0.99 | -0.73 | 0.07 | 0.62 | 1.00 |
| fourier_order_date_sin21_K3 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos21_K3 | 0 | 1 | 0.00 | 0.71 | -0.90 | -0.90 | -0.22 | 0.62 | 1.00 |
| fourier_order_date_sin28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
| fourier_order_date_cos28_K1 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.78 | 1.00 |
| fourier_order_date_sin28_K2 | 0 | 1 | 0.00 | 0.71 | -0.97 | -0.78 | 0.00 | 0.78 | 0.97 |
| fourier_order_date_cos28_K2 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.22 | 0.62 | 1.00 |
| fourier_order_date_sin28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.78 | 1.00 |
| fourier_order_date_cos28_K3 | 0 | 1 | 0.00 | 0.71 | -1.00 | -0.62 | 0.00 | 0.62 | 1.00 |
| fourier_order_date_sin357_K1 | 0 | 1 | -0.04 | 0.70 | -1.00 | -0.73 | -0.06 | 0.64 | 1.00 |
| fourier_order_date_cos357_K1 | 0 | 1 | -0.01 | 0.72 | -1.00 | -0.73 | -0.05 | 0.73 | 1.00 |
| fourier_order_date_sin357_K2 | 0 | 1 | -0.02 | 0.71 | -1.00 | -0.73 | -0.04 | 0.69 | 1.00 |
| fourier_order_date_cos357_K2 | 0 | 1 | 0.03 | 0.70 | -1.00 | -0.67 | 0.06 | 0.73 | 1.00 |
| fourier_order_date_sin357_K3 | 0 | 1 | 0.02 | 0.71 | -1.00 | -0.69 | 0.03 | 0.73 | 1.00 |
| fourier_order_date_cos357_K3 | 0 | 1 | 0.02 | 0.71 | -1.00 | -0.68 | 0.04 | 0.74 | 1.00 |
| lag_sales_trans_lag84 | 0 | 1 | -0.03 | 1.00 | -2.00 | -0.39 | 0.33 | 0.67 | 2.42 |
| lag_sales_trans_lag84_roll_30 | 0 | 1 | -0.03 | 0.33 | -1.02 | -0.23 | 0.00 | 0.21 | 0.64 |
| lag_sales_trans_lag84_roll_60 | 0 | 1 | -0.03 | 0.29 | -0.83 | -0.21 | 0.01 | 0.18 | 0.53 |
| lag_sales_trans_lag84_roll_90 | 0 | 1 | -0.03 | 0.26 | -0.83 | -0.18 | 0.00 | 0.16 | 0.46 |
1374 rows, 37 columns, no missing data. Dates now range from 2015-03-28 to 2018-12-30. Good to go.
I decided to split training/testing so we’re testing on the last 84 days, the same amount of days in our forecast_tbl. Could have split differently. Cross-Validation for sequential models need to be ordered, so making the testing set too big limits the benefit of time-series cross-validation. That’ll make sense soon.
splits <- data_prepared_tbl %>%
time_series_split(date_var = order_date,
assess = 84,
cumulative = TRUE)
splits %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(.date_var = order_date,
.value = sales_trans,
.smooth = FALSE,
.interactive = FALSE,
.title = "Initial Cross-Validation Plan",
.y_lab = "Sales") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc", "#db0000"))
{modeltime} was built to complement {tidymodels}, so the recipes, workflows, hyperparameter tuning, etc. all work the same.
I’m going to make two recipes for modeling workflows. One recipe will include the Fourier-Series, and one will lack Fourier-Series. I could have done this differently, but it works.
The step_timeseries_signature() function adds many time-series features based on our date variable, order_date. We just have to remove features we can’t use with our daily dataset, like features involving minutes, hours, etc. I also added steps to normalize numeric variables and dummy the nominal variables.
recipe_spec <- recipe(sales_trans ~ ., data = training(splits)) %>%
step_timeseries_signature(order_date) %>%
update_role(rowid, new_role = 'indicator') %>%
step_rm(matches("(.iso)|(xts)|(hour)|(minute)|(second)|(am.pm)")) %>%
step_normalize(matches('(index.num)|(year)|(yday)|(qday)|(mday)|(date_day)')) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
step_holiday(order_date, holidays = timeDate::listHolidays("US"))
recipe_spec uses all the Fourier-Series found in our data_prepared_tbl.
recipe_spec_no_f <- recipe(sales_trans ~ order_date
+ lag_sales_trans_lag84
+ lag_sales_trans_lag84_roll_30
+ lag_sales_trans_lag84_roll_60
+ lag_sales_trans_lag84_roll_90,
data = training(splits)) %>%
step_timeseries_signature(order_date) %>%
step_rm(matches("(.iso)|(xts)|(hour)|(minute)|(second)|(am.pm)")) %>%
step_normalize(matches('(index.num)|(year)|(yday)|(qday)|(mday)|(date_day)')) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
step_holiday(order_date, holidays = timeDate::listHolidays("US"))
recipe_spec_no_f does NOT use the Fourier-Series found in our data_prepared_tbl.
recipe_spec_no_f %>% prep() %>% juice() %>% glimpse()
## Rows: 1,290
## Columns: 58
## $ order_date <date> 2015-03-28, 2015-03-29, 2015…
## $ lag_sales_trans_lag84 <dbl> -1.002789645, -0.018295287, -…
## $ lag_sales_trans_lag84_roll_30 <dbl> -0.6456698, -0.6031017, -0.52…
## $ lag_sales_trans_lag84_roll_60 <dbl> -0.6811990, -0.6723552, -0.66…
## $ lag_sales_trans_lag84_roll_90 <dbl> -0.7196151, -0.7196938, -0.74…
## $ sales_trans <dbl> 0.56999672, 0.38250608, 0.481…
## $ order_date_index.num <dbl> -1.730038, -1.727353, -1.7246…
## $ order_date_year <dbl> -1.419664, -1.419664, -1.4196…
## $ order_date_half <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ order_date_quarter <int> 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,…
## $ order_date_month <int> 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,…
## $ order_date_day <dbl> 1.38891404, 1.50196926, 1.615…
## $ order_date_wday <int> 7, 1, 2, 3, 4, 5, 6, 7, 1, 2,…
## $ order_date_mday <dbl> 1.38891404, 1.50196926, 1.615…
## $ order_date_qday <dbl> 1.5404251, 1.5780733, 1.61572…
## $ order_date_yday <dbl> -0.9656001, -0.9555644, -0.94…
## $ order_date_mweek <int> 4, 4, 5, 5, 5, 1, 1, 1, 1, 2,…
## $ order_date_week <int> 13, 13, 13, 13, 13, 14, 14, 1…
## $ order_date_week2 <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ order_date_week3 <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ order_date_week4 <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ order_date_mday7 <dbl> 1.6685116, 1.6685116, 1.66851…
## $ order_date_month.lbl_01 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_02 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_03 <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_04 <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,…
## $ order_date_month.lbl_05 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_06 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_07 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_08 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_09 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_10 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_month.lbl_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_wday.lbl_1 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ order_date_wday.lbl_2 <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
## $ order_date_wday.lbl_3 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ order_date_wday.lbl_4 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ order_date_wday.lbl_5 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ order_date_wday.lbl_6 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ order_date_wday.lbl_7 <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ order_date_USChristmasDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USColumbusDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USCPulaskisBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USDecorationMemorialDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USElectionDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USGoodFriday <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ order_date_USInaugurationDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USIndependenceDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USLaborDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USLincolnsBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USMemorialDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USMLKingsBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USNewYearsDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USPresidentsDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USThanksgivingDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USVeteransDay <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ order_date_USWashingtonsBirthday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
Even without Fourier-Series, look at all the features that may help us model future sales. We started with just order_date.
We don’t know which algorithms will work best with our data to predict sales, so lets train and test a bunch of models. Building workflows makes it so easy to move from one model to the next. Notice how many different models I tested and the code chunks are nearly identical. Experimenting becomes easy and limited only by your CPUs.
Let’s build:
Certain models use a date object like order_date as a predictor. Some models CANNOT and we’ll predict on all variables in our recipes EXCEPT order_date.
Models that use a date object predictor: ARIMA, ARIMA_boost, Neral Network Autoregression (NNETAR), Prophet, and Prophet_boost
Models that CANNOT use a date object predictor: XGBoost, Support-Vector Machine (SVM), Random Forest (RF), Neural Network (NNET), Multiple Adaptive Regression Splines (MARS), and Elastic-Net Regularized Generalized Linear Models (GLMNET)
Ok, lets build!
Only model w/out a workflow
set.seed(321)
model_fit_arima <- arima_reg() %>%
set_engine('auto_arima') %>%
fit(sales_trans ~ order_date, data = training(splits))
Keep order_date feature
wflw_fit_arimaboost <- workflow() %>%
add_model(spec = arima_boost() %>% set_engine('auto_arima_xgboost')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_arimaboost_no_f <- workflow() %>%
add_model(spec = arima_boost() %>% set_engine('auto_arima_xgboost')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
Keep order_date feature
wflw_fit_prophet <- workflow() %>%
add_model(spec = prophet_reg() %>% set_engine('prophet')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_prophet_no_f <- workflow() %>%
add_model(spec = prophet_reg() %>% set_engine('prophet')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
Set order_date to ‘indicator’
wflw_fit_xgboost <- workflow() %>%
add_model(spec = boost_tree(mode = 'regression') %>% set_engine('xgboost')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_xgboost_no_f <- workflow() %>%
add_model(spec = boost_tree(mode = 'regression') %>% set_engine('xgboost')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
Keep order_date feature
wflw_fit_prophet_xgboost <- workflow() %>%
add_model(spec = prophet_boost(seasonality_daily = FALSE,
seasonality_weekly = FALSE,
seasonality_yearly = FALSE) %>% set_engine('prophet_xgboost')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_prophet_xgboost_no_f <- workflow() %>%
add_model(spec = prophet_boost(seasonality_daily = FALSE,
seasonality_weekly = FALSE,
seasonality_yearly = FALSE) %>% set_engine('prophet_xgboost')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
I turned Prophet seasonalities off so prophet is only used for trend. XGBoost will model seasonality with the Prophet model’s residuals using the calandar features from the recipe.
Set order_date to ‘indicator’
set.seed(321)
wflw_fit_svm <- workflow() %>%
add_model(spec = svm_rbf(mode = 'regression') %>% set_engine('kernlab')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_svm_no_f <- workflow() %>%
add_model(spec = svm_rbf(mode = 'regression') %>% set_engine('kernlab')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
Set order_date to ‘indicator’
set.seed(321)
wflw_fit_rf <- workflow() %>%
add_model(spec = rand_forest(mode = 'regression') %>% set_engine('ranger')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_rf_no_f <- workflow() %>%
add_model(spec = rand_forest(mode = 'regression') %>% set_engine('ranger')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
Set order_date to ‘indicator’
set.seed(321)
wflw_fit_nnet <- workflow() %>%
add_model(spec = mlp(mode = 'regression') %>% set_engine('nnet')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_nnet_no_f <- workflow() %>%
add_model(spec = mlp(mode = 'regression') %>% set_engine('nnet')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
Keep order_date feature
set.seed(321)
wflw_fit_nnetar <- workflow() %>%
add_model(nnetar_reg() %>% set_engine('nnetar')) %>%
add_recipe(recipe_spec) %>%
fit(training(splits))
wflw_fit_nnetar_no_f <- workflow() %>%
add_model(nnetar_reg() %>% set_engine('nnetar')) %>%
add_recipe(recipe_spec_no_f) %>%
fit(training(splits))
Set order_date to ‘indicator’
wflw_fit_mars <- workflow() %>%
add_model(spec = mars(mode = 'regression') %>% set_engine('earth')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_mars_no_f <- workflow() %>%
add_model(spec = mars(mode = 'regression') %>% set_engine('earth')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
Set order_date to ‘indicator’
wflw_fit_glmnet <- workflow() %>%
add_model(spec = linear_reg(mode = 'regression',
penalty = 0.1,
mixture = 0.5) %>% set_engine('glmnet')) %>%
add_recipe(recipe_spec %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
wflw_fit_glmnet_no_f <- workflow() %>%
add_model(spec = linear_reg(mode = 'regression',
penalty = 0.1,
mixture = 0.5) %>% set_engine('glmnet')) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date, new_role = 'indicator')) %>%
fit(training(splits))
That’s a lot of code, but also lots of copy/pasting followed by small changes. It’s really not that bad. Let’s collect each model/workflow and measure performance on the testing data.
Collect and Rename
submodels_tbl <- modeltime_table(model_fit_arima,
wflw_fit_arimaboost,
wflw_fit_arimaboost_no_f,
wflw_fit_prophet,
wflw_fit_prophet_no_f,
wflw_fit_prophet_xgboost,
wflw_fit_prophet_xgboost_no_f,
wflw_fit_svm,
wflw_fit_svm_no_f,
wflw_fit_rf,
wflw_fit_rf_no_f,
wflw_fit_nnet,
wflw_fit_nnet_no_f,
wflw_fit_nnetar,
wflw_fit_nnetar_no_f,
wflw_fit_mars,
wflw_fit_mars_no_f,
wflw_fit_glmnet,
wflw_fit_glmnet_no_f) %>%
update_model_description(1, "ARIMA(0,1,1)(0,0,2)[7]") %>%
update_model_description(2, "ARIMA_boost") %>%
update_model_description(3, "ARIMA_boost-no_fourier") %>%
update_model_description(4, "PROPHET") %>%
update_model_description(5, "PROPHET-no_fourier") %>%
update_model_description(6, "PROPHET_boost") %>%
update_model_description(7, "PROPHET_boost-no_fourier") %>%
update_model_description(8, "SVM") %>%
update_model_description(9, "SVM-no_fourier") %>%
update_model_description(10, "RandomForest") %>%
update_model_description(11, "RandomForest-no_fourier") %>%
update_model_description(12, "NNET") %>%
update_model_description(13, "NNET-no_fourier") %>%
update_model_description(14, "NNETAR") %>%
update_model_description(15, "NNETAR-no_fourier") %>%
update_model_description(16, "MARS") %>%
update_model_description(17, "MARS-no_fourier") %>%
update_model_description(18, "GLMNET") %>%
update_model_description(19, "GLMNET-no_fourier")
Calibrate on Testing Data
calibration_tbl <- submodels_tbl %>% modeltime_calibrate(testing(splits))
Measure Accuracy
calibration_tbl %>% modeltime_accuracy() %>% arrange(rmse) %>% table_modeltime_accuracy(.interactive = FALSE)
| Accuracy Table | ||||||||
|---|---|---|---|---|---|---|---|---|
| .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
| 9 | SVM-no_fourier | Test | 0.35 | 314.23 | 0.48 | 58.48 | 0.53 | 0.52 |
| 14 | NNETAR | Test | 0.38 | 310.80 | 0.52 | 61.80 | 0.54 | 0.50 |
| 8 | SVM | Test | 0.39 | 355.90 | 0.53 | 63.28 | 0.56 | 0.45 |
| 15 | NNETAR-no_fourier | Test | 0.41 | 259.44 | 0.56 | 67.29 | 0.57 | 0.46 |
| 7 | PROPHET_boost-no_fourier | Test | 0.38 | 351.48 | 0.52 | 58.60 | 0.57 | 0.44 |
| 11 | RandomForest-no_fourier | Test | 0.45 | 253.48 | 0.61 | 71.94 | 0.59 | 0.43 |
| 10 | RandomForest | Test | 0.48 | 258.79 | 0.65 | 78.93 | 0.62 | 0.39 |
| 6 | PROPHET_boost | Test | 0.42 | 309.38 | 0.57 | 65.50 | 0.63 | 0.31 |
| 17 | MARS-no_fourier | Test | 0.44 | 428.06 | 0.60 | 68.69 | 0.63 | 0.32 |
| 12 | NNET | Test | 0.48 | 260.89 | 0.65 | 75.25 | 0.64 | 0.27 |
| 5 | PROPHET-no_fourier | Test | 0.46 | 425.59 | 0.63 | 75.72 | 0.65 | 0.29 |
| 16 | MARS | Test | 0.47 | 243.98 | 0.64 | 81.43 | 0.65 | 0.29 |
| 19 | GLMNET-no_fourier | Test | 0.49 | 290.47 | 0.67 | 85.03 | 0.66 | 0.26 |
| 18 | GLMNET | Test | 0.50 | 288.36 | 0.68 | 89.65 | 0.66 | 0.26 |
| 3 | ARIMA_boost-no_fourier | Test | 0.50 | 291.62 | 0.68 | 79.44 | 0.67 | 0.21 |
| 4 | PROPHET | Test | 0.49 | 484.70 | 0.67 | 80.61 | 0.68 | 0.25 |
| 2 | ARIMA_boost | Test | 0.50 | 241.05 | 0.69 | 79.06 | 0.70 | 0.18 |
| 13 | NNET-no_fourier | Test | 0.54 | 216.05 | 0.74 | 84.72 | 0.70 | 0.24 |
| 1 | ARIMA(0,1,1)(0,0,2)[7] | Test | 0.61 | 162.54 | 0.83 | 102.57 | 0.76 | 0.13 |
Looking for models with low ‘rmse’ and high ‘rsq’. Those perform the best fit on the testing data. I want to see how only the 6 best models perform.
Collect 6 Best
submodels_tbl_2 <- modeltime_table(wflw_fit_svm_no_f,
wflw_fit_nnetar_no_f,
wflw_fit_svm,
wflw_fit_prophet_xgboost_no_f,
wflw_fit_nnetar,
wflw_fit_rf_no_f) %>%
update_model_description(1, "SVM_no_f") %>%
update_model_description(2, "NNETAR_no_f") %>%
update_model_description(3, "SVM") %>%
update_model_description(4, "PROPHET_boost_no_f") %>%
update_model_description(5, "NNETAR") %>%
update_model_description(6, "RandomForest_no_f")
Visualize Best 6 Models Fit on Testing Data
calibration_tbl <- submodels_tbl_2 %>% modeltime_calibrate(testing(splits))
calibration_tbl %>%
modeltime_forecast(new_data = testing(splits),
actual_data = filter_by_time(data_prepared_tbl, .start_date = "2018-09"),
keep_data = TRUE) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Models fit on testing(splits)",
.y_lab = "sales_trans",
.line_size = 1) +
theme_dark_grey() +
scale_color_todd_dark_bright()
Not bad. But we can tune these models for a better fit before forecasting.
There are two types of cross-validation we’ll need for resampling. K-Fold CV for models that can train on random data and Time-Series CV for models that need to train on sequential or ordered data. You’ll see.
K-Fold CV
set.seed(321)
resamples_kfold <- training(splits) %>% vfold_cv(v = 6)
resamples_kfold %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(order_date,
sales_trans,
.facet_ncol = 2,
.interactive = FALSE,
.title = "K-fold CV Plan",
.y_lab = "sales_trans") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc", "#db0000"))
Time-Series CV
set.seed(321)
resamples_tscv <- time_series_cv(data = training(splits) %>% drop_na(),
cumulative = TRUE,
assess = "12 weeks",
skip = '6 weeks',
slice_limit = 6)
resamples_tscv %>%
tk_time_series_cv_plan() %>%
plot_time_series_cv_plan(order_date,
sales_trans,
.facet_ncol = 2,
.interactive = FALSE,
.title = "Time-Series CV Plan",
.y_lab = "sales_trans") +
theme_dark_grey() +
scale_color_manual(values = c("#cccccc", "#db0000"))
I did 2-3 rounds of tuning for all 6 models, I’ll show the first and last round of tuning for just 2 of the 6 models. We’ll still compare performance of all 6 tuned models.
Each of my tuning sets consist of:
After each set I narrow the tuning grid, depending on what the plot shows me. Then I do it again until additional tunes make little impact on the rmse/rsq. Hint - pay attention to the y-axis of the plots. Models tuning with many tuning parameters could take ~8 minutes to run on my 2019 16 GB, 1.6 GHz MacBook Air. I could use an upgrade.
Tunable Model
model_spec_svm_tune <- svm_rbf(mode = 'regression',
cost = tune(),
rbf_sigma = tune(),
margin = 0.17 ) %>%
set_engine('kernlab')
Workflow
Change order_date feature to ‘indicator’
wflw_spec_svm_tune <- workflow() %>%
add_model(model_spec_svm_tune) %>%
add_recipe(recipe_spec_no_f %>% update_role(order_date,
new_role = 'indicator'))
Tuning Grid 1
grid_spec_svm_1 <- grid_latin_hypercube(parameters(model_spec_svm_tune),
size = 20)
Tuning Round 1
set.seed(321)
tune_results_svm <- wflw_spec_svm_tune %>%
tune_grid(resamples = resamples_kfold,
grid = grid_spec_svm_1,
control = control_grid(verbose = TRUE))
Results Table 1
tune_results_svm %>% show_best('rmse', n = 10)
## # A tibble: 10 x 8
## cost rbf_sigma .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.142 1.23e-2 rmse standard 0.880 6 0.0328 Preprocesso…
## 2 25.4 3.52e-5 rmse standard 0.883 6 0.0338 Preprocesso…
## 3 9.08 6.84e-2 rmse standard 0.921 6 0.0223 Preprocesso…
## 4 0.0256 6.92e-3 rmse standard 0.972 6 0.0326 Preprocesso…
## 5 4.30 6.81e-6 rmse standard 1.01 6 0.0329 Preprocesso…
## 6 0.0768 3.05e-1 rmse standard 1.03 6 0.0325 Preprocesso…
## 7 0.00592 8.26e-4 rmse standard 1.03 6 0.0327 Preprocesso…
## 8 0.0508 3.77e-1 rmse standard 1.03 6 0.0326 Preprocesso…
## 9 0.0195 1.74e-4 rmse standard 1.03 6 0.0327 Preprocesso…
## 10 18.2 1.75e-7 rmse standard 1.03 6 0.0327 Preprocesso…
Results Plot 1
tune_results_svm %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
The range of the rmse and rsq y-axis is wide-enough to use a round of tuning. Update the grid and do round 2.
Tuning Grid 2
grid_spec_svm_2 <- grid_latin_hypercube(cost(range = c(3.2,3.5),
trans = scales::log2_trans()),
rbf_sigma(range = c(-2.45,-2.35),
trans = scales::log10_trans()),
size = 20)
Tuning Round 2
set.seed(321)
tune_results_svm_2 <- wflw_spec_svm_tune %>%
tune_grid(resamples = resamples_kfold,
grid = grid_spec_svm_2,
control = control_grid(verbose = TRUE))
Results Table 2
tune_results_svm_2 %>% show_best('rmse', n = 10)
## # A tibble: 10 x 8
## cost rbf_sigma .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10.3 0.00359 rmse standard 0.813 6 0.0269 Preprocessor1_M…
## 2 10.7 0.00359 rmse standard 0.813 6 0.0267 Preprocessor1_M…
## 3 9.59 0.00378 rmse standard 0.813 6 0.0269 Preprocessor1_M…
## 4 10.8 0.00364 rmse standard 0.813 6 0.0266 Preprocessor1_M…
## 5 10.2 0.00372 rmse standard 0.813 6 0.0267 Preprocessor1_M…
## 6 10.4 0.00372 rmse standard 0.813 6 0.0267 Preprocessor1_M…
## 7 9.76 0.00383 rmse standard 0.814 6 0.0268 Preprocessor1_M…
## 8 9.20 0.00392 rmse standard 0.814 6 0.0269 Preprocessor1_M…
## 9 9.46 0.00388 rmse standard 0.814 6 0.0268 Preprocessor1_M…
## 10 9.87 0.00396 rmse standard 0.814 6 0.0265 Preprocessor1_M…
Results Plot 2
tune_results_svm_2 %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
Good enough. Look at the y-axis. Another round of tuning won’t make a big difference.
Fit Tuned Workflow on Original Traning Data
wflw_fit_svm_no_f_tuned <- wflw_spec_svm_tune %>%
finalize_workflow(select_best(tune_results_svm_2, 'rmse')) %>%
fit(training(splits))
Tunable Model
model_spec_nnetar_tune <- nnetar_reg(mode = 'regression',
seasonal_period = 7,
non_seasonal_ar = tune(),
seasonal_ar = tune(),
num_networks = 10,
hidden_units = tune(),
penalty = tune(),
epochs = 50) %>%
set_engine('nnetar')
Workflow
Keep order_date feature
wflw_spec_nnetar_tune_f <- workflow() %>%
add_model(model_spec_nnetar_tune) %>%
add_recipe(recipe_spec)
Tuning Grid 1
grid_spec_nnetar_f_1 <- grid_latin_hypercube(parameters(model_spec_nnetar_tune),
size = 20)
Tuning Round 1
set.seed(321)
tune_results_nnetar_f <- wflw_spec_nnetar_tune_f %>%
tune_grid(resamples = resamples_tscv,
grid = grid_spec_nnetar_f_1,
control = control_grid(verbose = TRUE))
Results Table 1
tune_results_nnetar_f %>% show_best('rmse', n = 10)
## # A tibble: 10 x 10
## non_seasonal_ar seasonal_ar hidden_units penalty .metric .estimator
## <int> <int> <int> <dbl> <chr> <chr>
## 1 4 1 9 6.10e- 1 rmse standard
## 2 2 0 6 4.15e-10 rmse standard
## 3 2 0 5 1.44e- 1 rmse standard
## 4 3 2 7 7.34e- 7 rmse standard
## 5 2 1 6 6.35e- 2 rmse standard
## 6 5 1 7 3.76e- 5 rmse standard
## 7 3 1 2 2.64e- 7 rmse standard
## 8 5 0 5 1.30e- 9 rmse standard
## 9 1 2 1 8.72e- 6 rmse standard
## 10 4 2 8 8.67e- 3 rmse standard
## # … with 4 more variables: mean <dbl>, n <int>, std_err <dbl>,
## # .config <chr>
Results Plot 1
tune_results_nnetar_f %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
Tuning Grid 2
set.seed(321)
grid_spec_nnetar_f_2 <- grid_latin_hypercube(non_seasonal_ar(range = c(1,1)),
seasonal_ar(range = c(0,2)),
hidden_units(range = c(5,5)),
penalty(range = c(-6,-3),
trans = scales::log10_trans()),
size = 10)
Tuning Round 2
set.seed(321)
tune_results_nnetar_f_2 <- wflw_spec_nnetar_tune_f %>%
tune_grid(resamples = resamples_tscv,
grid = grid_spec_nnetar_f_2,
control = control_grid(verbose = TRUE))
Results Table 2
tune_results_nnetar_f_2 %>% show_best('rmse', n = 10)
## # A tibble: 10 x 10
## non_seasonal_ar seasonal_ar hidden_units penalty .metric .estimator
## <int> <int> <int> <dbl> <chr> <chr>
## 1 1 2 5 0.000460 rmse standard
## 2 1 1 5 0.00000292 rmse standard
## 3 1 0 5 0.00000541 rmse standard
## 4 1 0 5 0.000160 rmse standard
## 5 1 2 5 0.0000212 rmse standard
## 6 1 0 5 0.0000135 rmse standard
## 7 1 1 5 0.0000865 rmse standard
## 8 1 1 5 0.00000189 rmse standard
## 9 1 2 5 0.000645 rmse standard
## 10 1 1 5 0.0000595 rmse standard
## # … with 4 more variables: mean <dbl>, n <int>, std_err <dbl>,
## # .config <chr>
Results Plot 2
tune_results_nnetar_f_2 %>%
autoplot +
geom_point(aes(color = '#fed9a6'), show.legend = F) +
geom_smooth(se = FALSE, size = 0.5) +
theme_dark_grey()
Fit Tuned Workflow on Original Traning Data
wflw_fit_nnetar_f_tuned <- wflw_spec_nnetar_tune_f %>%
finalize_workflow(select_best(tune_results_nnetar_f_2, 'rmse')) %>%
fit(training(splits))
After lots of tuning, we have 6 models that have been tuned and fit back onto the training data. Now let’s evaluate their performance on the testing data.
Collect Tuned Models
submodels_tuned_tbl <- modeltime_table(wflw_fit_prophet_boost_tuned,
wflw_fit_svm_f_tuned,
wflw_fit_svm_no_f_tuned,
wflw_fit_rf_tuned,
wflw_fit_nnetar_no_f_tuned,
wflw_fit_nnetar_f_tuned) %>%
update_model_description(1, "PROPHET_XGBOOST-tuned") %>%
update_model_description(2, "SVM-fourier-tuned") %>%
update_model_description(3, "SVM-no_fourier-tuned") %>%
update_model_description(4, "Random Forest-tuned") %>%
update_model_description(5, "NNETAR-no_fourier-tuned") %>%
update_model_description(6, "NNETAR-fourier-tuned")
Calibrate on Test Data
calibration_tuned_tbl <- submodels_tuned_tbl %>% modeltime_calibrate(testing(splits))
Measure Accuracy
calibration_tuned_tbl %>% modeltime_accuracy() %>% arrange(rmse)
## # A tibble: 6 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 SVM-no_fourier-tuned Test 0.371 359. 0.506 57.8 0.543 0.483
## 2 6 NNETAR-fourier-tuned Test 0.403 363. 0.549 63.4 0.559 0.447
## 3 2 SVM-fourier-tuned Test 0.395 381. 0.538 63.3 0.571 0.427
## 4 5 NNETAR-no_fourier-t… Test 0.430 250. 0.586 68.9 0.581 0.427
## 5 4 Random Forest-tuned Test 0.419 304. 0.571 64.5 0.586 0.403
## 6 1 PROPHET_XGBOOST-tun… Test 0.421 376. 0.575 62.2 0.604 0.361
Visualize Fit on Test Data
calibration_tuned_tbl %>%
modeltime_forecast(new_data = testing(splits),
actual_data = filter_by_time(data_prepared_tbl, .start_date = "2018-09"),
keep_data = TRUE) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Tuned Models fit on testing(splits)",
.y_lab = "sales_trans",
.line_size = 1) +
theme_dark_grey() +
scale_color_todd_dark_bright()
The hyperparameter tuning helped the performance of some models, at least regarding fit on the testing data. I could do some more work tuning, but lets move forward and forecast future sales with these tuned models. We need to refit the models on the entire dataset now, not just the training or testing data. Once refit, we can forecast daily sales for the next 84 days.
Refit on All Data
refit_tbl <- calibration_tuned_tbl %>%
modeltime_refit(data = data_prepared_tbl)
After refitting, we can un-transform the data using those transformation parameters we saved way up top and visualize our models on a true y-axis of sales.
Visualizing Sales Forecast with 6 Tuned-Models Zoomed In
refit_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl,
keep_data = FALSE) %>%
mutate(across(.value:.conf_hi, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value:.conf_hi, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
filter_by_time(.start_date = '2018-10' ) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Tuned Models 3-Month Forecast",
.y_lab = "Daily Sales ($)",
.line_size = 0.8) +
theme_dark_grey() +
scale_color_todd_dark_bright()
Visualizing Sales Forecast with 6 Tuned-Models Zoomed Out
refit_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl,
keep_data = FALSE) %>%
mutate(across(.value:.conf_hi, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value:.conf_hi, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
#filter_by_time(.start_date = '2018-10' ) %>%
plot_modeltime_forecast(.conf_interval_alpha = 0.02,
.conf_interval_fill = 'skyblue',
.interactive = FALSE,
.title = "Tuned Models 3-Month Forecast",
.y_lab = "Daily Sales ($)",
.line_size = 0.5) +
theme_dark_grey() +
scale_color_todd_dark_bright()
There’s plenty more we could do to improve these models, but the last thing I’ll show here is a quick and easy weighted model. Instead of using all 6 tuned models individually, I wanted to give the best performing tuned models a scalable weight and make 1 weighted ensemble model. My best fitting model was SVM-no_fourier-tuned, the worst fitting was PROPHET_XGBOOST-tuned. My weights ranged from 10 for the best to 5 for the worst.
Weighted Ensemble Model
calibration_tuned_tbl %>% modeltime_accuracy() %>% arrange(rmse)
## # A tibble: 6 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 SVM-no_fourier-tuned Test 0.371 359. 0.506 57.8 0.543 0.483
## 2 6 NNETAR-fourier-tuned Test 0.403 363. 0.549 63.4 0.559 0.447
## 3 2 SVM-fourier-tuned Test 0.395 381. 0.538 63.3 0.571 0.427
## 4 5 NNETAR-no_fourier-t… Test 0.430 250. 0.586 68.9 0.581 0.427
## 5 4 Random Forest-tuned Test 0.419 304. 0.571 64.5 0.586 0.403
## 6 1 PROPHET_XGBOOST-tun… Test 0.421 376. 0.575 62.2 0.604 0.361
weight_ensemble_tbl <- calibration_tuned_tbl %>%
ensemble_weighted(loadings = c(5,7,10,6,8,9)) %>%
modeltime_table()
Calibrate Accuracy of Weighted Ensemble Model
weight_ensemble_tbl %>% modeltime_accuracy(testing(splits))
## # A tibble: 1 x 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ENSEMBLE (WEIGHTED)… Test 0.380 334. 0.518 59.8 0.547 0.473
Refit Ensemble on All Data
refit_weight_tbl <- weight_ensemble_tbl %>%
modeltime_refit(data = data_prepared_tbl)
Visualizing Sales Forecast with Ensemble Model
refit_weight_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl) %>%
mutate(across(.value, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
#filter_by_time(.start_date = '2018') %>%
plot_modeltime_forecast(.interactive = FALSE,
.title = "Weighted Ensemble Forecast",
.y_lab = "Daily Sales ($)",
.line_size = 0.6) +
theme_dark_grey() +
scale_color_todd_dark_bright()
If you need the actual forecast values, be sure to un-transform and save any individual model or ensemble data. These files have a lot of information, so they can get quite large.
Collect Un-Transformed Forecast Data
forecast_tbl_final <- refit_weight_tbl %>%
modeltime_forecast(new_data = forecast_tbl,
actual_data = data_prepared_tbl) %>%
mutate(across(.value, .fns = ~ standardize_inv_vec(x = .,
mean = std_mean_sales,
sd = std_sd_sales))) %>%
mutate(across(.value, .fns = ~ log_interval_inv_vec(x = .,
limit_lower = limit_lower,
limit_upper = limit_upper,
offset = offset))) %>%
filter(.key == 'prediction')
forecast_tbl_final
## # A tibble: 84 x 5
## .model_id .model_desc .key .index .value
## <int> <chr> <fct> <date> <dbl>
## 1 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2018-12-31 1847.
## 2 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-01 89.0
## 3 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-02 204.
## 4 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-03 1.37
## 5 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-04 566.
## 6 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-05 788.
## 7 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-06 910.
## 8 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-07 860.
## 9 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-08 874.
## 10 1 ENSEMBLE (WEIGHTED): 6 MODELS prediction 2019-01-09 150.
## # … with 74 more rows
The .value is our daily sales prediction for the corresponding date (.index).
Our 3-month forecast predicts a reduction in sales over January and February. The models then predict an increase in March sales. There’s always the possibility for unexpectedly big spikes in single day sales, as demonstrated in our model confidence intervals. However, it is more logical to expect sales to decrease until March, so the company should prepare supply-chains, warehouse inventory, and hire/train new employees accordingly.
For more nuanced decisions, we could run another version of this pipeline to forecast sales or orders for different categories/segments or for individual states/regions. Similar to our EDA towards the beginning of this project.
In a future post I’ll show how to use the {modeltime.ensemble} package for creating more sophisticated ensembles. Until then, I hope you’ve gotten a good sense for what forecasting with {modeltime} can do. If you already model with {tidymodels}, {modeltime} is a package worth digging into. Find the complete code @ github.com/TWarczak.